Change your assumptions (hard, need more stats)
Transform \(y\), \(x\), or both
Change your assumptions (hard, need more stats)
Transform \(y\), \(x\), or both
How is job difficulty related to salary?
Mean function may be nonlinear. Residuals have slightly heavy-tailed distribution.
Variance seems to be increasing. No troubling influential values.
Let, \(U\) be a strictly positive variable, then the power family of transformations is defined by
\[\psi(U, \lambda) = U^\lambda\]
Approach:
Minimize RSS(\(\widehat{\lambda}\))
Use an inverse response plot
Application:
Use the invResPlot function found in the car package
invResPlotinvResPlot(salary_mod) # prints lambda and RSS in console
## lambda RSS ## 1 0.2533122 417350444 ## 2 -1.0000000 598392675 ## 3 0.0000000 425733469 ## 4 1.0000000 492062225
Approach 1: Create a new column in the data frame
library(dplyr) salarygov <- mutate(salarygov, lmaxsalary = log(MaxSalary)) log_mod1 <- lm(lmaxsalary ~ Score, data = salarygov) tidy(log_mod1)
## term estimate std.error statistic p.value ## 1 (Intercept) 7.825242067 1.864010e-02 419.80678 0.000000e+00 ## 2 Score 0.001889238 3.695849e-05 51.11783 3.553568e-199
Approach 2: Apply the transformation in the lm formula
log_mod2 <- lm(log(MaxSalary) ~ Score, data = salarygov) tidy(log_mod2)
## term estimate std.error statistic p.value ## 1 (Intercept) 7.825242067 1.864010e-02 419.80678 0.000000e+00 ## 2 Score 0.001889238 3.695849e-05 51.11783 3.553568e-199
The mean function appears to be linear and the residuals are well-approximated by the normal distribution.
There are no influential points and the appears to be constant.
We can still use predict to make predictions…
newdata <- data.frame(Score = 469) # avg. difficulty score cip <- predict(log_mod2, newdata = newdata, interval = "confidence") cip
## fit lwr upr ## 1 8.711295 8.697832 8.724757
but we need to back-transform
exp(cip)
## fit lwr upr ## 1 6071.097 5989.913 6153.381
We can automate the selection of the power transformation using the modified power family originally defined by Box and Cox (1964)
\[ \psi_M(Y, \lambda_y) = \begin{cases} \text{gm}(Y)^{1 - \lambda_y} \times (Y^{\lambda_y - 1})/\lambda_y & \lambda_y \ne 0\\ \text{gm}(Y) \times \log(Y) & \lambda_y =0 \end{cases} \]
where \(\text{gm}(Y) = \exp \left( \sum \log(y_i)/n \right)\)
- Transforming for normality of residuals
- \(\lambda\) can be estimated via maximum likelihood
boxCox(salary_mod) # to get the plot
summary(powerTransform(salary_mod)) # print the estimate
## bcPower Transformation to Normality ## ## Est.Power Std.Err. Wald Lower Bound Wald Upper Bound ## Y1 0.1292 0.0647 0.0025 0.256 ## ## Likelihood ratio tests about transformation parameters ## LRT df pval ## LR test, lambda = (0) 3.927686 1 0.04749724 ## LR test, lambda = (1) 179.872520 1 0.00000000
How are tree height and tree diameter related for the Western red cedar?
invTranPlotinvTranPlot(Height ~ Dbh, data = ufcwc) # prints lambda and RSS in console
## lambda RSS ## 1 0.04787804 152122.8 ## 2 -1.00000000 197352.2 ## 3 0.00000000 152232.3 ## 4 1.00000000 193739.7
Approach 1: Create a new column in the data frame
ufcwc <- mutate(ufcwc, ldbh = log(Dbh)) rc_mod1 <- lm(Height ~ ldbh, data = ufcwc) tidy(rc_mod1)
## term estimate std.error statistic p.value ## 1 (Intercept) -463.3144 32.437870 -14.28313 6.505273e-29 ## 2 ldbh 119.5192 5.531705 21.60621 5.761417e-46
Approach 2: Apply the transformation in the lm formula
rc_mod2 <- lm(Height ~ log(Dbh), data = ufcwc) tidy(rc_mod2)
## term estimate std.error statistic p.value ## 1 (Intercept) -463.3144 32.437870 -14.28313 6.505273e-29 ## 2 log(Dbh) 119.5192 5.531705 21.60621 5.761417e-46
The mean function appears to be linear and the residuals are well-approximated by the normal distribution.
There are no influential points and the appears to be roughly constant.
If you added a new column to the data frame…
newdata <- data.frame(ldbh = log(600)) predict(rc_mod1, newdata = newdata, interval = "prediction")
## fit lwr upr ## 1 301.2415 234.8101 367.673
If you transformed Dbh in the model formula…
newdata2 <- data.frame(Dbh = 600) predict(rc_mod2, newdata = newdata2, interval = "prediction")
## fit lwr upr ## 1 301.2415 234.8101 367.673
How is the volume of a ship's cargo related to the time required to load and unload the cargo?
Approach 1:
Approach 2:
Transform X and Y simultaneously using the Box-Cox procedure
invTranPlot(Time ~ Tonnage, data = glakes, lambda = c(-1, -.5, -.33, 0, .33, .5, 1))
## lambda RSS ## 1 0.9591321 3313.093 ## 2 -1.0000000 13096.852 ## 3 -0.5000000 10533.037 ## 4 -0.3300000 9446.566 ## 5 0.0000000 7167.611 ## 6 0.3300000 5064.994 ## 7 0.5000000 4240.574 ## 8 1.0000000 3319.425
cargo_mod1 <- lm(Time ~ I(Tonnage^.33), data = glakes) invResPlot(cargo_mod1, lambda = c(-1, -.5, -.33, 0, .33, .5, 1))
## lambda RSS ## 1 0.2569068 2681.954 ## 2 -1.0000000 4833.227 ## 3 -0.5000000 3606.552 ## 4 -0.3300000 3268.026 ## 5 0.0000000 2804.903 ## 6 0.3300000 2692.605 ## 7 0.5000000 2802.512 ## 8 1.0000000 3817.548
cargo_mod2 <- lm(log(Time) ~ I(Tonnage^.33), data = glakes)
The mean function appear to be linear and the residuals are well-approximated by the normal distribution.
There are no influential points and the appears to be roughly constant.
summary(powerTransform(cbind(Time, Tonnage) ~ 1, glakes))
## bcPower Transformations to Multinormality ## ## Est.Power Std.Err. Wald Lower Bound Wald Upper Bound ## Time 0.0228 0.1930 -0.3554 0.4011 ## Tonnage 0.2378 0.1237 -0.0046 0.4802 ## ## Likelihood ratio tests about transformation parameters ## LRT df pval ## LR test, lambda = (0 0) 3.759605 2 1.526202e-01 ## LR test, lambda = (1 1) 45.315290 2 1.445140e-10
cargo_bcmod <- lm(log(Time) ~ I(Tonnage^.25), data = glakes)
The mean function appears to be linear and the residuals are well-approximated by the normal distribution.
There are no influential points and the appears to be roughly constant.
If we apply a log transformation to both the response and the predictor, then
\[\% \Delta Y \approx \beta_1 \times \% \Delta x\]
So, for every 1% increase in x, the model predicts a \(\beta_1 \%\) increase in Y
\(\beta_1\) needs to be small for this to work out (see p.79 for details)